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Abstract 

Trust region algorithms provide a robust iterative technique for solving noncon- 
vex unconstrained optimization problems, but in many instances it is prohibitively 
expensive to compute high accuracy function and gradient values for the method. Of 
particular interest are inverse and parameter estimation problems, since function and 
gradient evaluations involve numerically solving large systems of differential equations. 

We present global convergence theory for trust region algorithms in which neither 
function nor gradient values are known exactly. The theory is formulated in a Hilbert 
space setting so that it can be applied to variational problems as well as the finite di- 
mensional problems normally seen in trust region literature. The conditions concerning 
allowable error are remarkably relaxed: relative errors in the gradient values of 0.5 or 
more are allowed by the theory. One form of the gradient error condition is automati- 
cally satisfied if the error is orthogonal to the gradient approximation. A technique for 
estimating gradient error and improving the approximation is also presented. 


*This research was supported by the National Aeronautics and Space Administration under NASA Con- 
tract No. NAS 1-1 860 5 while the author was in residence at the Institute for Computer Applications in 
Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23665. 
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1 Introduction 


An increasingly important area of computational mathematics involves problems requiring 
both numerical simulation and numerical optimization techniques. For example, in the 
design of a large flexible structure such as the space station, engineers may derive an ODE 
or PDE model of the structure based on a number of design parameters, define an objective 
(cost) function for the possible designs based on some criteria of interest (weight, flexibility, 
controllability, cost), and use a numerical optimization routine to find the “best” set of 
design parameters. Often the differential equations involved in the model are not amenable 
to analytic solutions, and therefore the calculation of objective function and gradient values 
in the optimization routine will involve the numerical solution of a system of differential 
equations. Problems of this type are common not only in structural design, but also in 
control, parameter estimation, and image reconstruction, to name but a few areas. A number 
of such problems are surveyed by Minkoff [11], Among the points stressed in his study are the 
very wide range of applications in which these problems are encountered, the computationally 
intense nature of the simulations, and the wide availability of numerical packages such as 
ODEPACK [7] which allow the user to specify the amount of computational accuracy desired 
in the simulation. Clearly, “exact” function and gradient evaluations are not feasible in such 
situations, and low accuracy evaluations may even be desirable in cases where computational 
expense increases very rapidly with increased accuracy in the simulation. Equally clear is 
the fact that sufficiently large errors will cause the optimization algorithm to fail. 

Trust region algorithms for nonlinear optimization have been an increasingly popular 
choice in recent years because of their elegance, efficiency, and robust convergence proper- 
ties. In this paper, we establish global convergence results for a class of these algorithms 
when neither function nor gradient values are computed exactly. The conditions concerning 
allowable error are both natural and exceedingly mild. Although trust region methods are 
most commonly applied to finite dimensional problems, in this paper we emulate Toint [19] 
and present our analysis in a general Hilbert space setting so that the trust region algorithm 
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can, in principle, be applied directly to a variational (distributed parameter) problem rather 
than to a finite dimensional discretization of the problem developed at an early stage of the 
design process. A comparison of the relative merits of these two approaches is an interesting 
research issue, but is beyond the scope of this paper. 

Synopsis. In Section 2, we define our problem and present the trust region algorithm. 
Our conditions for admissible error in function and gradient values are introduced and briefly 
discussed. In Section 3, we present several properties associated with the computation of 
trial steps from sequences of quadratic models. Using these properties and our conditions 
concerning function and gradient error, we first establish that at least a subsequence of 
the gradient approximations converges to zero, and then the stronger result that both the 
sequence of gradients and the sequence of gradient approximations converge to zero. In 
Section 4, we discuss implementation of the gradient error conditions, and suggest a technique 
for directly estimating the gradient error if other estimates are not available. This technique 
is particularly appropriate if gradients are computed using a finite difference procedure, and 
can also be used to improve the accuracy of a given approximation. In Section 5, we discuss 
a few of the possible generalizations of our theory. In Section 6, we summarize our results. 

2 Preliminaries 

Let H denote a real Hilbert space, and consider the problem 

minimize /(*). (!) 

xeH 

for some functional f : H —* fft. For a given vector io £ H, let ft be an open convex subset 
of H containing the level set of / at x 0 . We assume 
A.l / is Frechet differentiable on ft, 

A.2 / is bounded below, and 

A.3 the Frechet derivative of /, denoted is Lipschitz continuous on ft. 


2 


Our trust region algorithm for solving ( 1 ) generates a sequence of iterates {a:*.} by pro- 
ducing and approximately solving a sequence of constrained quadratic model problems. That 
is, xjfc+i = x k + s k for a step s k that approximately solves 

minimize 't/’fc ( x k + s ) : ||s|| < A k (2) 

where A k is a positive variable known as the trust radius and i}) k is a quadratic model of the 
objective functional / about the point x k . Let (•,•) denote the inner product on H, and let 
|[ • || denote the associated norm or the induced operator norm. Our quadratic model ifi k will 
then be of the form 

fa(x k + s) = f k + (g k ,s) + ^{B k s,s), ( 3 ) 

where f k is our approximation to f(x k ), g k is our approximation to V/(xjfc), the gradient of 
/ at x k , and B k is a self-adjoint operator from H into H approximating W 2 f(x k ). 

If f k 7 ^ /(x*), we must specify conditions on how much error is allowable. These condi- 
tions will apply to the difference between successive function values rather than to errors in 
the values themselves. Define the actual function reduction 

ared k (s k ) = f(x k ) - f(x k + s k ), (4) 

the computed function reduction 

cred fc(s fe ) = f k - f k +i, (5) 

and the predicted function reduction 

pred^sjt) = i> k [x k ) - ip k (x k + s k ). (6) 

We then require two conditions to be satisfied at every iteration for some appropriately 
chosen constants and ^ i2 : 

|ared fc (s fc ) - cred jt (s A: )| < £ f<1 pred t (s fc ), (7) 

and 

|ared*(s*) - cred fc (s fc )| < ^ /| 2 |credjt(s fc )|. ( 8 ) 
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Since direct estimates of |aredj.(sfc) — credos*) | are probably not available in most ap- 
plications, in practice ( 7 ) and (8) should be replaced by 


\fk+l ~ f( x k+l)\ + \fk ~ f{ x k)\ < 0.lP red fc( 5 *) 


( 9 ) 


and 

l/fc+1 - f(xk+ 1 )| + l/jfe - f( x k) | < 0.2 1 credos*) I . (10) 

If gk ^ V/(ijfc), we must similarly specify conditions on how much error is allowable in the 
gradient. Define 

e* = 9 k ~ V/(x/fc). ( 11 ) 


We will show that the condition 


< e fc) 9 k) 


< i 


9 > 


( 12 ) 


(y*) 9k) 

will lead to the global convergence result lim inf ||^ fc || = 0 for appropriately chosen 
constant while the stronger result limjfe_ KX) ||yjt|| = limjt-nx, || V/(xjt)|| = 0 can be obtained 
by using the stronger condition 


l£jfe| 

\ 9 k\ 


<L- 


( 13 ) 


Our algorithm is structured as follows. 


Algorithm (1): Trust region method using inexact function and gradient eval- 
uations. 

Let the constants 0 < 771 < 772 < 1 and 0 < 71 < 1 < 72 be prespecified, and select the 
error control constants and Cg such that 

£9 + £f,i < 1 — V 2 , (14) 

and 

< 1 . ( 15 ) 

Select an initial guess xqzH and an initial trust radius Ao- Compute initial function and 
gradient values / 0 and go, and compute or initialize Bq. 
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For k = 0, 1, ...until “convergence” do: 

(a) Determine an approximate solution s k to problem ( 2 ). 

(b) Calculate pred fc (s*) and credos*.). If necessary, recompute f k+1 and/or f k to 


greater accuracy until (7) and ( 8 ) are satisfied. 

(c) Compute the ratio 

_ credfc(s^) 

Pk P red k {s k ) 

(d) If p k < r}i, then set Afc+i £ (0,7 iA*.], 


(16) 


else if p k < r} 2 , then set Afc+i € (0, A*], 


else set A k+1 £ [A*, 7 2 A*]. 

( e ) If Pk < V i, then the current step is unacceptable. Set xjt + i = x k . 

Otherwise, the iteration is successful. Set x * +1 = x k + s k . 

(f) If x k+1 ^ x fc , then compute g k+1 and compute or update B k+1 . 


Otherwise, retain the current values by setting /t+i = f k , g k +i = g k , B k+1 = B k . 

End Loop. 

At this point, a number of comments should be made concerning the algorithm. 

1 . The maximum, error levels given by (14) and (15) are extremely mild. A typical value 
for 772 in an algorithm might be 0 . 1 , in which case we could select £ g = 0 . 5 , £j t \ = 0 . 3 , 
and £f t 2 = 0.99 so that we are allowed a relative error of one half in the gradient 
approximation and a similarly large error in the difference between successive function 
values. 
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2. Conditions (7), (8), (12) and (13) are different than the conditions used in [12], [19] 
and [3]. All of these papers consider only the case = /(x*). Instead of (12) and 
(13), [12] uses the consistency condition 

{x k -► x*} =$> lim ||ejk|| = 0 (17) 

*—►00 

for the case H = 3£ n , while [19] uses the condition 


||ejk]| < min{/ci,K 2 Ajt} (18) 

in the very general setting of Hilbert space with simple bounds. Condition (13) is used 
in [3], but the weaker condition (12) is not considered. 


3. In practice, the trust region defined in (2) is often replaced by the scaled trust region 
ll-MI < A* for some invertible linear operator D j c . For simplicity we take D to be 
the identity in this paper, but note that our results can still be established provided 
(12) and (13) are replaced by 


(Ok 1 ‘I ■ > jVj*) < t 

(Di l 9k , D?g t ) ~ 


(19) 


and 


K'gk\ 


<t 


g> 


( 20 ) 


respectively, and some restrictions are placed on the sequence of scalings {D *,}. A 
complete treatment of (20) for the case f k = f(x k ) and H = 3? n can be found in [3]. 


4. In algorithm (1), no requirement is made that g k be recomputed to greater accuracy 
following unsuccessful iterations. This is an important property, since even for the case 
where function and gradient values are known exactly, unsuccessful iterations are quite 
common and merely indicate that the trust radius should be adjusted. 

5. On the other hand, conditions (7) and (8) may require function values to be recomputed 
to greater accuracy at any iteration. Although these conditions are therefore less 
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elegant than (12) and (13), they are still fairly natural and can be relatively easily 
implemented if error in / is controllable. 

The optimal strategy for enforcing (7) and (8) will of course depend on such factors as 
the reliability of error estimates and the amount of work involved in recomputing f k 
to a greater accuracy. For example, the following simple procedure could be used to 
implement step(b) of algorithm(l). 

Procedure 2. 

Let a e (0,1) be prespecified. Given x k ,s k ,^ k , and an estimate for \f k — /(x fc )|, do 
the following. 

(i) Calculate pred fc (.Sfc), and set emax = i pred fc (sfc). 

(ii) If necessary, recompute f k to greater accuracy so that the inequality \fk~ f(x k )\ < 
(1 — a) emax holds . 

(iii) Compute f k+1 so that \f k +i — /(x;t + i)| < a emax. 

(iv) Compute credos*.). If condition (10) is satisfied, then exit procedure, else reduce 
emax and return to (ii). 

End procedure. 

Clearly, if error estimates are unreliable should be chosen fairly small. On the other 
hand, £/ >2 should usually be selected close to one to avoid unnecessary recomputations 
of f k and f k+ 1 . If such recomputations are very expensive, one might consider taking 
a close to one. 

3 Convergence Results 

In order to establish our results, we will need several properties regarding our trial steps 3 *.. 
These properties are 

pred fc (s fc ) > l -c x \\g k \\ min{A*, M/ca}, (21) 
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(22) 


M — C 3^A:j 

and 

{lim m£ | W | > 0 and Em A* = 0} => Hm " = !. ( 23 ) 

for some constants c 4 £ (0,1), c 2 E (0,oo), and c 3 E [1,2]. Obviously, (21), (22) and (23) are 
directly dependent on both the methods used to compute trial steps and the properties of 
the sequence of quadratic models chosen. For the special case H — a few comments are 
in order. 

Condition (21) is well knows (see, for example, [18]) and is usually established by assum- 
ing an upper bound of the form 

11-8*11 < c 2 (24) 

However (21) can also be established [2] given an upper bound of the form 

(■ B k g k , gk) < C 2 {9k, 9k) ( 25 ) 

Condition (23) can be interpreted geometrically as stating that steps s k tend in direction 
toward -g k as ||sjk||/J|Sk|J goes to zero. This property is established in [3] by assuming an 
upper bound on {||J?*||}, but can also be established using the milder condition (25) for one 
of the popular classes of techniques of computing trial steps (generalized dogleg methods). 

In the context of inexact function and gradient values, assumptions such as (24) or (25) 
are quite reasonable: if first order information is not known accurately it is only natural to 
directly enforce an upper limit on our approximation to second order information. In this 
paper, however, we make no assumptions about how (21), (22), and (23) are obtained. The 
only assumption we directly use concerning the sequence {i?*} is that 

-c 4 {s,s) < (B k s,s) (26) 

at every iteration for all s E H and some constant c 4 . First note the following simple results. 
Lemma 3.1. Let / satisfy assumptions A.l and A. 3 and let c 5 be the Lipshitz constant 
associated with V /. Then we have 

pred*(s) - aredfc(s) < -(c 4 + c 5 )||s|| 2 - (e*, s). (27) 
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Proof. Using an integral representation of pred fc (s) — aredjt(s), we have 


1 r i 

pred fc (s) — ared fc (s) = ~{g k ,s) - ~(B k s,s) + / (V/(xjt + As), s)dX 

Z Jo 

= -(e k ,s)-\{B k s,s) + [\vf(x k + Xs)-Vf(x k ),s)dX. (28) 
Z Jo 

Using (26), the Cauchy-Schwarz inequality, and the Lipshitz continuity of V/, we have 

pred fc (s) - ared fc (s) < -(e ky s) + -c 4 \\s\\ 2 + f ||V/(x* + As) - V/(x fc )|j||.s||dA 

Z Jo 

< — { e Jfc> + ^ c 4lN| 2 + cs||As|| ||s|| d\, (29) 

which immediately establishes (27). □ 

Lemma 3.2. Let / satisfy assumptions A.l, A. 2, and A. 3, and let Q, 0 be the interior of the 
level set of / at Xo- We then have that V/ is bounded on fio- 

Proof. Let c 5 be the Lipschitz constant associated with V/, and let c 6 be such that /(x) > 
c 6 Vx G ft 0 . Now, suppose V/ is unbounded on fio so that 3x G flo with ||V/(x)|| 2 > 

8c 4 (/(xo) — Ce). Define J = ^V/(x). For all a sufficiently small that x + s G 0, we have 

/(x)-/(x + s) = - l\vf(x),s)dX- [\vf(x-Xs)~Vf(x),J)dX 

Jo Jo 

> ^HIVf(x)|| ! - tc 5 ||J|) 2 (30) 

> i-||V/(x)|| 2 (l-|) 

4 c 5 Z 

> 2o:(l - y)(/(zo) - c 6 ). 

Now, the final term in (30) is positive for all a G (0,2) and hence x + s G for a = 1. But 
this leads to the contradiction /(x) — /(x + s) > /(x 0 ) — c 6 , so V/ cannot be unbounded on 

fio- □ 

We now establish that {g k } is not bounded away from zero. 

Theorem 3.3. Let / satisfy assumptions A.l, A. 2, and A. 3, let the steps generated in Algo- 
rithm (1) satisfy (21), (22), and (23), let {B fc } satisfy (26), and let the function evaluations 
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satisfy (7) and (8). Then our algorithm generates a sequence of iterates satisfying 


lim inf Hflkll = 0 

K — ► OO 


provided condition (12) holds. 


Proof. Let K, denote the set of successful iterations. First notice that pred fc (.s*) > 
from (16), cred*(5fc) > ^ipred^-s*) Wk E K„. From (8) we have 

(1 - C/,2)cred*(s*) < ared*(s*) < (1 + C/,2)cred*(s*) 

Combining (32) with (16) and (21) yields. 

ared*(s*) > (1 - Cf,2)Vi pred*(s*) 

> ^c 1 7 7l (l - £ />2 ) |!0*|| min{A*, ||0*||/c 2 } 

for all k E K s . 

Next, define 6 k such that 

{~5*,**) 

llftll INI 

and w k £ H such that w k — 0 if sin 6 k = 0 and 


cos 9 l 


1 / 5* g k 

w k = ~ — 7T 77— TT + cos e k 


9k A 

ii5.ii; 


sin (9* V ||**|| ' ll^ifel 

otherwise. Notice that [g k ,w k ) = 0 , {e k ,w k ) = (— V/(x*), to*), and |[iu*|j = 1 for si 
0, and that 


S k = ||s*||(-cos5* 


9k 


II 9k 


+ sin 6 k to*). 


Now, from (7) and (16) and the fact that pred^s*.) > 0, we have 

pred fc (s*) - credos*) 


1 - Pk — 


< 


pred*(s*) 

pred k (s k ) - ared*(s*) + ared*(.s*) - cred*(.s*) 
pred*(s*) 

predj t (s*) - ared*(s*) 


pred*(5*) 


+ 0 , 1 - 


(31) 

0 and 

(32) 

(33) 

(34) 

(35) 

1 h ji 

(36) 

(37) 
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Using (3), (6), (26), and (27) gives 


1 “ Pk < C/,1 + 


|(c 4 + c 5 ))|5 fc || 2 — (efc,5A,) 
~{9k, s k ) - ^{B k s k ,s k ) 
(efc, s k ) + 2 (c 4 + C 5 ) [|-Sjfc|| ^ 


^ /,1+ ~(9k,s k ) + |c 4 ||s fc || 2 

Substituting (36) into (38) yields 

Jj^|cos0 fc ( e fc>£fc) - IMI sin 6 k (e k , w k ) + |(c 4 + c 5 )\\s k \ 


(38) 


|j^j[cosO:(Sfc>S , fc) - |[5jt|| sin9 k (g k ,w k ) + |c 4 ||s fc || 2 
+ Sf <V/(**W) + i(c t + c s 


1 — Pk < 0,1 + 


f , *-K 9 h, 9 k) ' llgfcll \ ■ _^_[jgfc|| /qqn 

^ COS^fc + |c 4 ||sjfc||/||5fc|| V ; 

Now, suppose liminffc-Kx, || 5 jt|| > 0. Since / is bounded below, (33) implies that 
limbic. A* = 0 and hence lim^oo A* = 0. But if lim^oo A* = 0, from (23) we have 
lim Afc _,o cos0* = 1 and lim Afc _ 0 sin^ = 0. By Lemma 3.2 and the Cauchy Schwarz inequal- 
ity, (V/(xfc), w k ) is bounded, and hence from (14) and (22) we have 

+ in 1WM; + Kf! + c »)a 


i:^ 1 - p-) ^ <»+ii2o 


cos 9 k + |c 4 (c 3 Aj t ) 2 /||5 , *:|| 2 


< 0,1 + 


{ek,g k ) 


{gk, gk) 

£ 0,1 + 0 < 1 7 72 J 


(40) 


and hence p *. > t/ 2 for all sufficiently large k. But this is a contradiction since p*. > rj 2 => 
Ajt+i > A*. Hence, liminf^oo ||flfjt|| cannot be greater than zero. □ 

The final result of this section uses the stronger error condition (13) to establish the 
stronger result lim*^*, ||V/(x*)|| = 0. 


Theorem 3.4. Let / satisfy assumptions A.l, A. 2, and A. 3, let the steps generated in 
Algorithm (1) satisfy (21) and (22), and the function evaluations satisfy (8). Then (13) 
and (31) imply 

lim || 4 *|| = lim ||V/(**)|| = 0. (41) 

k — ► oo Ac— ► oo 

Proof: First note that from (13) with any ( g < 1 we can immediately obtain the equiva- 
lences ( liminffc_» 00 || 4 *|| = 0) <=> (liminf*-,*, ||V/(x fc )ll = 0) and (lim*-,*, ||4*[| = 0) <*=> 
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(lim^oo ||V/(x fc )H = 0). Define e = |(1 - + <,,)• Since liminf*.^ ||^|| = 0 , for any 

m with || < 7 m|| 7 ^ 0 there exists fn > m for which || 5 m+i|| < e||< 7 m|| and ||< 7 *|| > e||< 7 m || Vfc E 
[m, m]. Using (22) and (33) we have 

m 

/(®m) - /(®m+l) > X ared fc(-5fc) 

k=m 
m i 

> X 2 Cl771 ^ “ 0,2)11^*11 min {A fe , |M/c 2 } (42) 

k=m 

> - 0 , 2 )£|bm|| X min { 7 -lbmll} 

2 k=m l C 3 C 2 J 

From the triangle inequality we have 

IMI < |M ~ flte+ljl + l|Sm + l|| 

< |km-^m+l||+c||^m||- ( 4 3) 

Rearranging terms, substituting < 7 * = ejt + V /(a:*), and again applying the triangle inequality, 
we have 


(l-e)IMI 


— 11.9m 9m+l|| 

< || V/(x m ) — V/(x sr +l)|| + || e m|| + || e m+l|| 

fn 

< X ll^/(®fc+l) ~ W(®fc)ll + ll e m|| + || e m+l|[ 

k—TTl 

m 

< C 5 X ll S *ll + IMI + ll e m+l|l 

k—m 


Using (13) in this equation with e = |(1 — ( g )/( 1 + C 3 ) and ||< 7 m+i|| < e||^ m || yields 


EW > 

k—m 

> 

> 

> 

> 


— [(! - e)\\9m\\ + IMI + ll e m+l||] 

c 5 

IIa.II 


C 5 


L ^ IMI. 

|| e m+ 1 

IM+l | 

(I e) Il7n.ll 

||<7m+l 

IIMI J 


Ibr, 


IMI 


c 5 


[1 - £ - Cg ~ C S £] 

[1 — Cj — £ (i + C 9 )] 


— i(l-C»)- 

C5 z 


(44) 


( 45 ) 
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Substituting into (42) gives 


/(*m) - /(i»+i) > eill^H 2 , (46) 

where 

£ i = ^1^(1 -C/,2)g min — ,7-}, (47) 

Z ^ Z C 3 C 5 C 2 J 

and hence 

lkm || 2 < ~ /(Xm+l)). (48) 

£1 

Now, {/(»*)} is nonincreasing and bounded below so that f(x k ) — » /* for some /*. Hence for 
any to, either g m = 0 or ||y m || < - /*), and lim*^*, || 5 *|| = lim^ ||V/(x*)|| = 0. 

□ 

4 Implementing the gradient error conditions 

In terms of global convergence results, (13) is clearly superior to (12) and should be en- 
forced whenever possible. The availability of error estimates will of course depend on the 
application, but we point to the increasing availability and use of high quality software 
such as ODEPACK [7] which allow the user to prespecify desired levels of accuracy in each 
component of the differential equation being solved. 


13 



Condition (12), although leading to the weaker convergence result liminfjfc_ 00 \\g k \\ = 0, 
has a number of interesting properties. 

First notice that if g k is a Galerkin approximation to V/(xjt), condition (12) is automat- 
ically satisfied since (e*,, g k ) — 0. Second, (12) is a much milder condition than (13) unless 
( g is close to one. For example, define 

Sr(C,,V/(s*)) = {9k ■ INI < CJNI} ( 49 ) 

and 

S p (( g , V/(**)) = {9k ■ N 9k) < Cg(9k , 9k)} ■ (50) 

These sets can be interpreted geometrically for V/ G 9? 2 . Figure 1 shows S T for a variety 
of values of Cg while figure 2 shows S p for the same values of £ g . Figures 3 and 4 directly 
compare S T and S p for = 0.1 and 0.5. For small values of £ gt condition (12) is significantly 
milder than (13). For larger values of ( g , the difference is less pronounced. 
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Figure 3 : S p and S T for ( g — 0.1. 
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Third, consider that (12) can be written as 

l _ <V/(gfc)»gfc) 

(9k, 9k) 


<C 3 - 


But (Vf(x k ),g k ) is simply the Gateaux differential of / at x k with increment g k : 

(V/(xfc), g k ) = lim i [f(x k + eg k ) - f(x k )] . (52) 

Hence, if error estimates are not available from other sources, (Vf(x k ),g k ) may be approx- 
imated by a finite difference formula and substituted into (51). Moreover, if the estimate 
for (V/(x fc ), <7*:) is sufficiently accurate, we can improve our gradient approximation g k by 
replacing it with the scaled gradient 

^ _ (V/feQrgt) ('53') 


9k = 


( 9k , 9k) 


so that 


(ejfc.fffc) = j_ _ <V/( 3C *)»gJb> = o. 

{9k » 9k) (9k, 9k) 
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Figure 5 : Gradient correction through the projection operation (53) 

This approach seems particularly attractive in the event that H — 3? n , the /* values are 
known very accurately, and gk is being approximated by finite differences. After computing 
a first approximation to g *. by, say, forward differences using n extra function evaluations, 
a central difference approximation to (V/(xjt), gk) using two additional function evaluations 
can be computed to validate the accuracy of g If (12) is violated, then the algorithm can 
henceforth use central difference approximations to g*. Notice that the normally difficult 
problem of selecting an appropriate perturbation size e in the finite difference procedure is 
simple in this case, since (gk>gk) can be used as a rough scale estimate for (V/(zjfc), g*). 
Following Dennis and Schnabel [5] we write 

{V/(x*), gk) ~ F(/(x* + £ 9k) - /( 2 * - egk)) ( 55 ) 

and chose an e that we expect will perturb two-thirds of the accurate digits of /. If a is the 
relative error in function evaluations, we want /(xjt + c^fc) — fk ~ ^ l ^ 3 \fk\i an d an appropriate 
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e is then 


e = ^ 1/3 \ fk\/{9k,9k)- 


(56) 


In addition to the possibility of estimating (V f(x k ),gk) via (55), some applications may 
admit a direct computation of this quantity. Whenever the action of f on a single vector 
can be computed with less expense and/or more accuracy than g k , such an approach should 
be considered. 

We wish to reemphasize that the convergence results using (12) are significantly weaker 
than the results using (13), and that (13) should be used whenever possible. If (12) is used, 
then whenever g k becomes sufficiently close to zero to trigger the convergence tests used by 
the algorithm, it should be recomputed to the maximum attainable accuracy to affirm that 
V/(x fc ) has also converged to zero. 

5 Extensions To Theory 

The conditions for allowable error in our algorithm have been formulated to be as simple and 
lucid as possible, and we again point out that the upper limits (14) and (15) are exceptionally 
mild. Due to the very broad range of potential applications, however, we should consider 
whether any of the details of our theory can be further relaxed. 

One extension that can be made to our theory is to enforce (7) only in expectation rather 
than at every iteration. Similarly, condition (8) could be enforced only in the limit provided 
the set 12 is large enough to include all the iterates (some of which may be uphill from /(x o).) 
Such a result is quite reasonable in that function values are only used to update the trust 
radius A*, and mistakes in this procedure can be tolerated as long as they balance out in the 
long run. Cognizant of the practical fact that computed error estimates in a simulation may 
occasionally be very poor, such a stochastic theory might seem attractive, but we prefer not 
to include it in this paper because it adds little insight to the analysis. 

In contrast to the situation for function evaluations, a single sufficiently bad gradient 
evaluation can cause the algorithm to fail. For example, if <7* = — V/(x^) at some iteration k, 
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then our algorithm can decrease A*, indefinitely without ever finding an acceptable step. One 
might speculate that a condition such as (gk, V/(x*)) > 0 might be sufficient to guarantee 
an acceptable step will eventually be found (assuming for the moment that function values 
are computed exactly), but this turns out to be slightly too general a condition. As pointed 
out in [3], the approximation gk = ^-V/(x^) may also cause the algorithm to fail. What 
then is the most general condition guaranteeing an acceptable step will always be found? 
Taking gk, V/(x*), and Bk to be fixed, taking (f t i ~ (f 7 2 = 0, and following the same general 
approach as the proof Theorem 3.3, it is straightforward to establish that limA fc _>oo (!-/>*) = 
lim £ fc -40 Hence our algorithm is assured of finding an acceptable step for sufficiently 

small A k if and only if g * is in the interior of S p (l—r]i, V/(x*)). Note that S p (l — 7]i } V /(x*.)) 
is a sphere with center V/(x*) and diameter ^-|| V/(x*.)||, and recall that a typical value 
for 7]i is 0.001. Our “worst case” limit for a single bad gradient evaluation is therefore only 
slightly more restrictive (from a practical point of view) than the requirement (gk, V /(x*)) > 
0. Of course, one should still strive for a more accurate approximation satisfying < 7 *. E 
5 P (1 — 772 , V/(x*)) or gk E S r ( 1 — 772 , V/(xjt)), but it is reassuring to know that the algorithm 
has such a large margin for recovery from occasional mistakes in gradient error control. 

A final slight extension to our theory comes from the observation that Theorem 3.4 is 
proven using only the bound ( g < 1 rather than ( g < 1 — 772 . Hence, we could obtain our 
same strong convergence results lim^oo ||<7jt|| = lim^oo ||^ 7 /( x Jfc)|| = 0 using two different 
bounds in (12) and (13). That is, we could require gk E 5p(Cs,i> V/(x*)) fl Sr{(g t 2> 
with <1 — 772 and 77^2 < 1* This is a slightly larger set of admissible approximations 
than S r (( 9il , V/(xjk)). 

6 Summary 

Using the error conditions (7), ( 8 ), and (12), we have established the result liminffc_,oo ||< 7 *|| = 
0 for our trust region algorithm using very mild assumptions. In particular, the requirement 
that ( g + C/,i < 1 — 772 is exceptionally generous, as ( g = 0.5 would typically correspond 
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to only one significant bit in each component of gk for H = $R n . Condition (12) may also 
be automatically satisfied if gk is computed by a projection technique such as a Galerkin 
method. 

If error estimates for g\ t are not available through other means, (12) can be evaluated 
through a one-dimensional finite difference test such as (53). In some applications, a separate 
numerical formulation might allow the action of f on gk to be computed directly. Besides 
allowing us to evaluate error condition (12), such approaches may allow us to improve each 
approximate gradient using (51) provided our estimate of (V /(x*,), gk) is accurate enough. 

The stronger convergence result lim^oo ||<7*.|| = lim*-^ || V/(x*)|| = 0 can be obtained by 
using condition (13) rather than (12). We recommend that this condition be used whenever 
possible. If a scaled version of the trust region algorithm is being used, gradient errors must 
be measured in the norm induced by the rescaling. 
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